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SUMMARY 

i--  An  efficient  algorithm  designed  to  be  used  for  N&vier-Stokes  simulations  of  complex  flows 
over  complete  configurations  is  described.  The  algorithm  incorporates  a  number  of  elements, 
including  an  explicit  three-dimensional  flow  solver,  embedded  mesh  refinements,  a  model 
equation  hierarchy  ranging  from  the  Euler  equations  through  the  full  Navier-Stokea  equa¬ 
tions,  multiple-grid  convergence  acceleration  and  extensive  vectorization  and  multitasking  for 
efficient  execution  on  parallelfprocesaing  supercomputers.  Results  are  presented  for  a  prelim¬ 
inary  trial  of  the  method  on  a  problem  representative  of  turbomachinery  applications.  Based 
on  this  performance  data,  it  is  estimated  that  a  mature  implementation  of  the  algorithm  will 
yield  overall  speedups  ranging  as  high  as  100. 

\ 

INTRODUCTION 

It  is  generally  recognized  that  a  comprehensive  approach  to  the  simulation  of  flows  involv¬ 
ing  both  complex  geometries  and  complex  physics  will  require  powerful  advanced-architecture 
supercomputers  with  very  large  memories.  Machines  capable  of  producing  solutions  to 
Reynolds-averaged  Navier-Stokes  flows  over  complex  geometries  within  computing  times 
short  enough  to  be  of  design  interest  are  expected  to  be  available  by  the  end  of  this  decade. 
In  order  to  use  these  parallel-processing  supercomputers  effectively,  algorithms  must  be 
adapted  to  focus  the  power  of  multiple  processing  units  on  a  single  flow  simulation.  The  pur¬ 
pose  of  this  work  is  to  contribute  to  the  development  of  such  algorithms.  The  approach 
selected  enhances  the  efficiency  of  a  robust  and  flexible  solution  procedure  by  implementing  it 
on  a  collection  of  local  meshes  embedded  in  a  global  mesh.  Either  the  Euler,  thin-layer 
Navier-Stokea  or  full  Navier-Stokes  equations  are  solved  on  each  mesh.  The  choice  of  model 
equations  is  determined  by  the  nature  of  the  flow  physics  to  be  resolved  on  a  particular  mesh. 
When  the  requirement  for  time  accuracy  is  relaxed,  a  convergence  acceleration  procedure  is 
applied  simultaneously  to  all  meshes  and  all  model  equations.  The  entire  algorithm  is  explicit 
and  is  designed  to  perform  well  on  computers  consisting  of  multiple  processing  units,  each 
having  vector  processing  capability.  Examples  of  such  machines  are  the  Cray  X-MP  and  Cray  2 


EQUATIONS  OF  MOTION 


The  nondimensional  equations  of  motion  may  be  written  in  conservation-law  form  as 
It  m 

where,  for  the  Reynolds-averaged  Navier-Stokes  equations, 
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Here  p,  u,  v,  w,  p  and  E  are  respectively  density,  velocity  components  in  the  x-,  y-  and  *- 
directions,  pressure  and  total  energy  per  unit  volume.  This  final  quantity  may  be  expreseed  as 


E  “  p 


e  +  -j-(u*  +  v*  +  »*) 


where  the  specific  internal  energy,  e,  is  related  to  the  pressure  and  density  by  the  simple  law 
of  a  calorically-perfect  gas 

P  m  (l  -  l)p« 

with  7  denoting  the  ratio  of  specific  heats.  The  coefficient  of  thermal  conductivity,  «,  and  the 


viscosity  coefficients,  A  and  p,  are  assumed  to  be  functions  only  of  temperature.  Further¬ 
more,  A  is  expressed  in  terms  of  the  dynamic  viscosity  p  by  invoking  Stokes’  assumption  of 
xero  bulk  viscosity.  Re  and  Pr  denote  the  Reynolds  and  Prandtl  numbers,  respectively. 
Although,  for  simplicity,  the  equations  of  motion  are  presented  here  written  in  Cartesian 
coordinates,  it  is  well  known  that  their  strong  conservation  law  form  may  be  maintained 
under  an  arbitrary  space-  and  time-dependent  transformation  of  coordinates. 

SOLUTION  METHODOLOGY 

The  integration  scheme  used  here  is  the  forward  predictor  -  backward  corrector  version  of 
the  two-step  Lax-Wendroff  method  due  to  MacCormack  ( 1] .  This  version  of  MacCormsck's 
scheme  is  used  for  convenience.  Any  of  its  many  variants  could  also  be  used,  as  could  any 
other  one-  or  two-step  Lax-Wendroff  scheme  [2],  In  fact,  the  class  of  fine-grid  methods  with 
which  the  convergence  acceleration  technique  described  below  may  be  applied  appears  to  be 
quite  large,  including  schemes  not  of  Lax-Wendroff  type  [3].  The  advantages  of 
MacCormack’s  method,  in  the  present  context,  are  its  explicit  nature,  simplicity  and  low 
operations  count.  A  disadvantage  is  its  conditional  stability  and  the  severe  time-step  size  limi¬ 
tation  which  this  imposes  for  viscous  flows,  in  particular.  The  ill  effects  of  conditional  stabili¬ 
ty  are  mitigated  through  the  use  of  embedded  grid  refinements  and  convergence  acceleration. 

The  embedded-mesh  technique  developed  for  the  present  application  is  a  generalisation  of 
that  employed  in  [4]  to  obtain  two-dimensional  Euler  solutions.  The  computational  domain  is 
divided  into  regions  requiring  grids  of  differing  fineness  and  the  resolution  of  different  Sow 
physics.  At  present,  for  simplicity,  this  partitioning  is  done  a-priori  However,  solution- 
adaptive  gridding  based  on  this  technique  is  possible.  Fig.  1  shows  typical  locations  for  the 
mesh  regions  employed  in  the  computations  described  subsequently  in  this  paper.  Note  that, 
where  mesh  lines  are  illustrated,  their  spacing  is  much  coarser  than  that  employed  in  the  com¬ 
putations.  Mesh  3,  the  coarsest  mesh  in  Fig.  1,  covers  the  entire  computational  domain.  The 
Euler  equations  are  solved  on  it.  Mesh  2,  finer  than  mesh  3  in  all  directions  by  a  factor  of 
two,  contains  the  regions  near  the  walls  and  the  blade  surface  where  flow  can  be  modeled  by 
the  thin-layer  form  of  the  equations  of  motion.  The  finest  mesh  shown,  mesh  1,  contains  the 
regions  of  the  domain  near  the  juncture  of  surfaces  where  all  viscous  terms  have  been  re¬ 
tained.  From  this  specific  example,  it  is  easy  to  see  that  quite  general  collections  of  embed¬ 
ded  meshes  may  be  constructed  in  this  manner.  The  embedded  meshes  are  not  disjoint. 
Rather,  given  a  mesh  labelled  m,  all  coarser  meshes  from  m-t-  1  through  the  coarsest  mesh 
used  in  the  computation  underlie  it.  This  property,  together  with  the  coarsening  factor  of  2, 
facilitate  the  use  of  the  multiple-grid  acceleration  techniques  described  in  (5).  The  fiowfield 
updating  begins  with  mesh  1.  After  one  timcstep  on  mesh  1,  mesh  2  is  updated  exterior  to 
mesh  1  while  convergence  acceleration  is  applied  at  the  mesh-2  points  interior  to  mesh  1. 
Next,  mesh  3  is  updated  exterior  to  mesh  2  while  convergence  acceleration  is  applied  at  the 


mesh- 3  point*  interior  to  mesh  2.  Updating  proceeds  in  this  fashion  until  the  global  mesh  has 
been  advanced  by  one  timestep.  Then,  convergence  acceleration  is  applied  to  coarsenings  of 
the  global  grid.  This  cycle  is  repeated  until  the  desired  measure  of  convergence  is  satisfied. 
Observe  that  when  both  the  basic  integration  scheme  and  the  coarse-grid  convergence  ac¬ 
celerator  are  explicit,  the  algorithm  is  particularly  easy  to  vectorize.  Additionally,  a  parallel 
coane-grid  algorithm  has  been  developed  for  more  efficient  execution  on  both  vector-  and 
parallel-processing  computers,  as  described  in  (5) .  Further,  note  that,  while  in  this  paper 
multiple-grid  convergence  acceleration  is  applied  only  to  steady  flow  simulations,  it  appears 
that  the  technique  may  extend  to  time-accurate  computation  of  some  unsteady  Bows  (0,  7). 

When  implementing  an  algorithm  on  a  multiple  instruction-multiple  data  machine,  we  are 
concerned  with  multitasking  overhead  and  algorithm  granularity.  By  granularity  we  mean  the 
time  required  to  execute  a  multitaskable  code  segment  on  a  single  processor.  For  a  given 
overhead,  the  best  speedup  is  obtained  when  algorithm  granularity  is  maximal.  Large  granu¬ 
larity  is  usually  introduced  by  top-down  programming  which  exploits  global  parallelism  in  the 
algorithm.  Bottom-up  programming  exploits  algorithm  parallelism  at  a  low  level  by  making 
many  partitionings,  each  on  small  code  segments,  such  as  DO  loops  containing  independent 
statements.  The  sequential  multigrid  algorithm  contains  many  opportunities  for  creating  small 
granularity  parallelism  but  relatively  few  for  the  sort  of  large  granularity  necessary  to  produce 
good  speedup  in  the  face  of  non-trivial  overhead.  This  observation,  together  with  the  desira¬ 
bility  of  non-sequential  multigrid  schemes  for  reasons  of  algorithm  flexibility,  led  to  the  con¬ 
struction  of  the  parallel  multigrid  algorithms  mentioned  above.  In  these  algorithms,  grids 
which  are  independent  of  one  another  may  be  updated  simultaneously  on  separate  processors. 
In  fact,  such  a  simple  strategy  may  result  in  a  poor  load  balance  across  processors  because  of 
the  differing  amounts  of  work  inherent  in  updating  grids  of  different  coarseness.  However, 
more  refined  strategies  are  possible.  Grids  may  be  grouped  together  into  tasks  of  approxi¬ 
mately  equal  work,  or  they  may  be  melded  into  tasks  with  other  large-grained  code  segments 
in  order  to  equalize  processor  loading.  Notice  futher  that,  by  multitasking  large-grained  struc¬ 
tures,  the  vectorization  potential  of  code  within  these  structures  remains  intact. 

NUMERICAL  SIMULATION 

As  the  algorithm  described  in  this  paper  is  designed  to  efficiently  simulate  complex  flows 
over  complete  configurations,  it  should  be  tested  under  conditions  which  fully  exercise  its 
capabilities.  On  the  other  hand,  excessive  complexity  would  serve  no  useful  purpose  in  the 
initial  testing  phases  of  the  algorithm.  With  these  considerations  in  mind,  three-dimensional 
computations  are  being  carried  out  for  the  geometry  illustrated  in  Fig.  2,  a  rectilinear  cascade 
of  finite-span,  swept  blades  mounted  between  endwalls.  The  sweep  angle  ranges  from  0  to  20 
degrees.  The  blade  thickness  to  chord  ratio  ranges  from  0.0  to  0.2.  The  subcritical  computa¬ 
tions  are  performed  at  an  isentropic  inlet  Mach  number  of  0.5.  The  Mach  number  for  the  su- 


percritical  computation!  ia  0.675.  In  the  viscous  cases,  the  Reynolds  numbers,  based  on  cas¬ 
cade  sap  and  critical  speed,  span  the  approximate  range  from  8.4  x  10 3  to  2.0  x  10®.  The 
mesh  structure  on  which  the  computations  are  being  performed  is  illustrated  in  Fig.  1.  The 
full  Navier-Stokes  equations  are  solved  on  mesh  1.  The  thin-layer  Navier-Stokes  equations 
are  solved  on  mesh  2.  The  Euler  equations  are  solved  on  mesh  3.  Only  steady  Sows  are 
computed  and  convergence  acceleration,  as  described  previously,  is  applied.  The  entire  algo¬ 
rithm  is  vectorised  and  multitasked  to  run  on  a  four-processor  Cray  X-MP  or  Cray  2.  Sample 
results  for  a  subcritical  flow  over  a  swept  blade  an  shown  in  Fig.  3. 

Comparison  of  the  embedded-mesh  algorithm  with  a  single-mesh  algorithm  yields  the  fol¬ 
lowing  conclusion:  the  accuracy  of  the  embedded-mesh  results  is  essentially  that  of  a  global 
finest  mesh,  while  the  convergence  rate  is  like  that  of  a  global  coarsest  mesh.  Thus  far,  in 
two-dimensional  computations  using  the  Euler  and  thin-layer  Navier-Stokes  equations  and 
three  mesh  regions,  embedding  spec  dupe  as  high  as  30  have  been  obtained.  Three- 
dimensional  embeddings  using  Euler,  thin-layer  and  full  Navier-Stokes  regions  should  pro¬ 
duce  substantially  larger  speedups. 

Multiple-grid  convergence  acceleration  applied  to  three-dimensional  casea,  in  the  absence 
of  mesh  embedding,  has  yielded  speedups  ranging  from  2.5  to  4.7.  It  is  expected  that  there 
will  be  some  tradeoff  between  embedding  and  multigrid  speedup  in  the  complete  algorithm. 
Vectorization  of  the  three-dimensional  algorithm  without  embedded  meshes  results  in  speed- 
ups  ranging  from  3.6  to  5.7.  This  range  should  remain  about  the  same  in  the  final  algorithm. 

Using  a  top-down  multitasking  approach,  the  parallel  coarae-grid  algorithm  has  been  im¬ 
plemented  on  a  four  processor  Cray  X-MP,  for  two-dimensional  casea  without  mesh  embed¬ 
ding.  Initially,  only  the  coarse  grids  were  multitasked  so  that  the  performance  of  parallel  grids 
on  a  multiprocessor  could  be  evaluated.  Then  the  fine-grid  computations  were  partitioned  and 
multitasked,  and  the  resultant  code  was  integrated  with  the  parallelised  coarse  grids. 

For  the  multitasking  results,  performance  measures  are  based  on  a  comparison  of  multi¬ 
tasked  code  segments  with  their  unitaskcd  analogs.  The  parallel  coaise-grid  scheme  results 
were  obtained  with  a  five-grid  multigrid  sequence  length.  An  efficiency  of  nearly  00%  has 
been  obtained  using  two  processors,  but  that  efficiency  deteriorates  to  77%  when  four  proces¬ 
sors  are  used.  This  deterioration  is  a  result  of  distributing  multigrid  structures  containing 
unequal  amounts  of  work  across  four  processors.  Results  obtained  from  multitasking  the 
fine-grid  scheme  show  that  the  fine-grid  tasks  are  fairly  evenly  balanced,  and  this  code  seg¬ 
ment  performs  well  on  both  two  and  four  processors.  Processor  utilisation  of  00%  or  better  is 
achieved.  The  fully  multitasked  two-dimensional  multigrid  algorithm  attains  efficiency  levels 
ranging  from  04%  on  two  processors  to  83%  on  four  processors.  For  the  three-dimensional 
Navier-Stokes  code,  the  bottom-up  approach  is  taken  by  using  microtasking  software  on  the 
Cray  X-MP.  Microtasking  incurs  relatively  low  overhead,  which  allows  parallelisation  of  very 
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fine-grained  code  tegmenta  and  alleviates  the  need  for  careful  a-priori  load  balancing.  The 
resulting  fully  micro  tasked  three-dimensional  code  performance  ranged  from  08%  efficiency 
on  two  processors  to  89%  on  four  processors. 

Given  that  the  speedups  from  the  various  categories  described  above  are  generally  multi¬ 
plicative  in  effect,  it  is  to  be  conservatively  estimated  that  a  mature  implementation  of  the  al¬ 
gorithm  will  produce  overall  speedups  ranging  as  high  as  100. 


CONCLUSIONS 


An  efficient  algorithm  designed  to  be  used  for  Navier-Stokes  simulations  of  complex  flows 
over  complete  configurations  has  been  presented. 

The  algorithm  makes  use  of  several  elements:  a  robust  explicit  basic  flow  solver,  locally- 
embedded  mesh  refinements,  a  flow  simulation  hierarchy  ranging  from  the  Euler  equations 
through  the  full  Navier-Stokes  equations,  an  explicit  multiple-grid  convergence  acceleration 
technique,  and  both  vectorization  and  multitasking  for  efficient  execution  on  parallel¬ 
processing  supercomputers. 

Results  are  presented  for  a  problem  representative  of  turbomachinery  applications.  These 
results  provide  grounds  for  optimism  regarding  the  algorithm’s  future  application  to  more 
challenging  internal  and  external  flows.  Based  on  the  performance  data  presently  available, 
this  algorithm  is  expected  to  reduce  simulation  times  by  as  much  as  two  orders  of  magnitude. 
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